#116 Allow token processing "middleware" #144

nsantini · 2018-06-13T18:24:51Z

I added node-spellchecker to check for typos, and also "levenshtein" to find the closest spell correction to the original word.

I also modified the "negation" feature to look backwards until a negation word or a new afinn word is found, to cover cases like "not too bad".

…d found

nsantini · 2018-06-13T21:39:57Z

lib/index.js

        if (!labels.hasOwnProperty(obj)) continue;
+
+        // Check for negation


When I originally forked the repo, here we used to check if the previous word was a negation and invert the score. Now seems like we deal with negation further down. So we end up negating the score twice. Might be that my change to deal with negative words is redundant now

elyas-bhy

Hey @nsantini, great work. Just a couple of points:

The spell-checking feature looks good to me. However, we have to consider if we should add an option to enable / disable this feature (are there any use-cases where disabling spell-checking is required? Performance perhaps?). @thisandagain thoughts?
Regarding negation: since PR Add support for additional languages #128 for supporting additional languages, this logic has now moved to /languages/en/scoring-strategy.js, or more generally /languages/[language]/scoring-strategy.js. The scoring strategy determines what score to assign to a given token, and allows processing negation, emphasis, and any other language specifics.
You are currently performing the negation twice, that is what's making your tests fail.
It would be great to have a few validation tests for the backward lookup feature, there doesn't seem to be any.

nsantini · 2018-06-15T03:34:29Z

Ok

moved the negation backwards strategy to the language model
added unit test for it

nsantini · 2018-06-15T03:44:31Z

Finally, made spell checking optional, and unit tested it

elyas-bhy · 2018-06-15T10:37:12Z

Thanks for the quick update @nsantini, great work!
A few more things:

You seem to have added the distance/spellchecking feature exclusively to the english language (the only officially supported language for now, but other languages can be dynamically registered, see Adding new languages.
What would it take to make this feature accessible to all languages? Is there a way to set a locale for the spellchecker? If so, we can move the distance/spellchecking modules up a level, and make them available to all languages.
Could you please add the necessary documentation to the README file? Documenting the spellcheck flag would be quite helpful.
Tests look good to me. If you manage to get around the first point of this post, consider adding another test to make sure that your changes are also supported for a new language.

We're almost there!

nsantini · 2018-06-15T19:59:11Z

The current implementation uses https://www.npmjs.com/package/spellchecker , which are native bindings to NSSpellChecker, Hunspell, or the Windows 8 Spell Check API, depending on your platform. Windows 7 and below as well as Linux will rely on Hunspell. So it depends on what language you have in your system. Also it depends on those subsystems to support multiple languages.
So, in a way, the spellchecking implementation supports multiple languages.
I added the method signature to the language definition, so other language could support different implementations of the spellchecker.
So the question would be, do we want all language spellcheckers to be implemented in one way depending on the used library, or allow each language to use what they want?

elyas-bhy · 2018-06-18T20:54:46Z

Does this mean that, for example, if my OS is set to use the spanish locale, then the spellchecker module would always try to spellcheck with spanish as the target language, regardless of the actual language of the input? If this is the case, I feel like this would lead to a very inconsistent behaviour.

Is there a way to specify to the spellchecker which language to use?
/cc @thisandagain, thoughts?

nsantini · 2018-06-18T21:14:50Z

There is a mechanism to add a dictionary to the spellchecker, I believe thats how a language gets "set"

…ronously

nsantini · 2018-06-19T00:58:02Z

I have refactored the spell checking functionality to use a different library nspell that can support multiple languages.
Also updated README with examples of how to use spell checking

elyas-bhy · 2018-06-19T10:38:21Z

This is all great work, we're almost there!

If we officially add support to one or more new languages, we would have to duplicate the code for the spellchecking feature (since it is currently located in languages/en).
Do you think you can move the spellchecking / distance logic up a couple of levels and make it part of the core module (lib)? This would make the feature available to all languages.
At the language-level, I think it is best if we only provide the appropriate dictionary, and delegate the rest of the processing to the core module. It would make adding new languages much easier.

For example:

// languages/fr/index.js
module.exports = {
    labels: require('./labels.json'),
    dictionary: require('dictionary-fr'),  // specify the language dictionary here for spellchecking
    scoringStrategy: require('./scoring-strategy')
};

I have noticed a drastic performance hit when comparing before/after this PR:
Before:

sentiment (Latest) - Short  x 561,830 ops/sec ±1.76% (92 runs sampled)
sentiment (Latest) - Long   x 2,689 ops/sec ±1.41% (87 runs sampled)
Sentimental (1.0.1) - Short x 314,373 ops/sec ±1.50% (89 runs sampled)
Sentimental (1.0.1) - Long  x 1,171 ops/sec ±2.00% (88 runs sampled)

After:

sentiment (Latest) - Short  x 239,996 ops/sec ±1.64% (91 runs sampled)
sentiment (Latest) - Long   x 0.21 ops/sec ±3.55% (5 runs sampled)
Sentimental (1.0.1) - Short x 236,762 ops/sec ±5.24% (82 runs sampled)
Sentimental (1.0.1) - Long  x 895 ops/sec ±2.54% (80 runs sampled)

Ideally there should be barely any impact when the spellchecking feature is disabled (which it should be, by default). Could you please investigate this issue?

I have also noticed a slight decrease in code coverage (npm run test:coverage). Could you add a few more tests to cover those cases?

And finally, do you mind documenting this feature in the API Reference section of the README file as well?

Thank you for your patience @nsantini and for bearing with me!

thisandagain · 2018-06-19T13:50:49Z

In addition to the performance implications of this change necessitating the feature be "off" by default, I think the validation tests also strongly suggest that this should be optional:

Before

Amazon accuracy: 0.7202797202797203
IMDB accuracy: 0.7642357642357642
Yelp accuracy: 0.6943056943056943

After

Amazon accuracy: 0.7302697302697303
IMDB accuracy: 0.7532467532467533
Yelp accuracy: 0.6943056943056943

Nice improvement (~1%) on the Amazon validation set, but the IMDB accuracy dropped ~1.1% and Yelp stayed stable. This may suggest that the change is indeed helping with less formal speech, but is actually causing false positives in a more formal corpus (or vs versa), but more investigation would be required.

thisandagain

Looking good @nsantini! This is great to see. Agreed with @elyas-bhy's suggestions above. I also would love to further discuss the edit distance strategy and validation test differences as described in the comments.

thisandagain · 2018-06-19T13:55:47Z

languages/en/distance.js

+
+/**
+ * Finds the closest match between a statement and a body of words using
+ * Levenshtein Distance


Levenshtein Distance is a great performant strategy, but if we are going to make this optional anyway we might want to discuss / test alternative edit distance algorithms (e.g. Myers Diff Algorithm).

PDF:
myers.pdf

nsantini · 2018-06-19T21:13:11Z

Ok, moved the spell checking to the library and left loading the dictionary to the language. Added extra unit test, only two lines on the spell checking are not covered, but they are there for safekeeping of an edge case that shouldnt happen.
Looking into performance and accuracy:

sentiment (Latest) - Short x 493,203 ops/sec ±1.14% (90 runs sampled)
sentiment (Latest) - Long x 1,452 ops/sec ±3.65% (82 runs sampled)
Sentimental (1.0.1) - Short x 306,074 ops/sec ±2.98% (86 runs sampled)
Sentimental (1.0.1) - Long x 1,283 ops/sec ±2.45% (87 runs sampled)

Amazon accuracy: 0.7302697302697303
IMDB accuracy: 0.7532467532467533
Yelp accuracy: 0.6943056943056943

The changes to use spell checking, off by default, should not have affected accuracy, since I haven't changed how is tested.

elyas-bhy · 2018-06-20T11:39:29Z

languages/en/dictionary.js

+    if (dictiaonary ===  null) {
+        var base = require.resolve('dictionary-en-us');
+        dictiaonary = {
+            'aff': read(base.replace('.js', '.aff'), 'utf-8'),


Could you explain why do we need to perform such gymnastics when loading the dictionary, instead of simply requiring it and passing it over to nspell, as described in their usage example?

Also, there is a typo in the spelling of the "dictionary" variable.

Fixed the typo.

Regarding "gymnastics": the dictionary can be loaded on an async fashion. But the whole sentiment library works synchronously. So I decided to not pollute the whole library with callbacks or promises, and given that I couldn't use async/await to do it (eslint complained about it) I decided to load it this way.

elyas-bhy · 2018-06-20T11:41:15Z

README.md

+  getDictionary: {
+    apply: function() {
+      // Load a dictionary for the language for nspell to use
+      return { aff, dic };


What do these aff and dic properties correspond to? Could you give an example?

Explained more in the README file

This doesn't seem to be the case, are you sure that you have pushed your changes?

new commit, it references the explanation I gave in the "Spell checked example" section

nsantini · 2018-06-22T19:43:12Z

Hi, latest changes:

README references the documentation of nspell and how to add dictionaries

elyas-bhy · 2018-06-22T20:32:43Z

LGTM. @thisandagain?

nsantini · 2018-06-25T22:10:41Z

@thisandagain any more comments on this?

Regarding your comment about looking for alternative string distance algorithms, I would suggest leave this PR as it is (since the current implementation is performant and correct), and open a new issue to look for alternatives

nsantini · 2018-07-13T01:27:04Z

@thisandagain any calls regarding this PR?

nsantini · 2018-07-26T21:03:46Z

@elyas-bhy @thisandagain any updates?

elyas-bhy · 2018-07-26T21:05:21Z

Everything looks good on my side. Still waiting for @thisandagain to approve the changes.

nsantini · 2018-09-04T00:34:43Z

hi @thisandagain , just checking if you got a change to review the changes

elyas-bhy · 2018-09-13T10:48:40Z

Ping @thisandagain.

nsantini · 2019-02-17T19:44:13Z

Should I close this PR?
@thisandagain @elyas-bhy

elyas-bhy · 2019-02-18T10:43:10Z

I'm really looking forward to get this merged, but need @thisandagain to approve and merge this PR.

elyas-bhy · 2020-02-28T08:00:34Z

Ping @thisandagain

nsantini added 6 commits September 12, 2017 13:09

spell checking words

588de8d

using Levenshtein distance to spell check

90417ad

checking if word is mispelled before correcting it

131e4b8

checking for negating words backwards until end of token or afinn wor…

9d5218f

…d found

spell checking negation words withouth afinn

6a0521f

refactorign into files, update unit test to match new findings

08409bf

nsantini mentioned this pull request Jun 13, 2018

Allow token processing "middleware" #116

Open

nsantini added 2 commits June 14, 2018 06:37

Merging upstream to resolve PR conflicts

7e7dc9e

fixing variable rename

58b1a70

thisandagain requested review from thisandagain and elyas-bhy June 13, 2018 21:03

thisandagain assigned thisandagain and elyas-bhy Jun 13, 2018

thisandagain added the pr - needs review label Jun 13, 2018

nsantini commented Jun 13, 2018

View reviewed changes

elyas-bhy requested changes Jun 14, 2018

View reviewed changes

nsantini added 2 commits June 15, 2018 15:29

moving negation strategy to language module

52e62cf

adding unit test for backward search for negation

d08dfc2

making spell check optional and adding unit test

aaacd29

nsantini added 2 commits June 19, 2018 12:50

using nspell library for spell checking, and loading dictionary synch…

ecc5de2

…ronously

Adding readme section about spell checking

6a298f7

thisandagain added the pr - needs work label Jun 19, 2018

thisandagain assigned nsantini and unassigned elyas-bhy Jun 19, 2018

thisandagain requested changes Jun 19, 2018

View reviewed changes

nsantini added 2 commits June 20, 2018 09:04

Making spell checking available for all languages

1f03b7d

Documenting API for spell checking

ef9ecda

elyas-bhy reviewed Jun 20, 2018

View reviewed changes

nsantini added 2 commits June 21, 2018 06:35

typo fixed plus readme enhanced about dictioinaries

d22e26f

more specific documentation around dictionaries

9d25b4b

elyas-bhy approved these changes Jun 26, 2018

View reviewed changes

elyas-bhy requested review from thisandagain and elyas-bhy August 22, 2019 15:49

elyas-bhy approved these changes Aug 22, 2019

View reviewed changes

thisandagain added pr - needs review and removed pr - needs work labels Aug 28, 2019

thisandagain assigned thisandagain and unassigned nsantini Aug 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#116 Allow token processing "middleware" #144

#116 Allow token processing "middleware" #144

nsantini commented Jun 13, 2018

nsantini Jun 13, 2018

elyas-bhy left a comment

nsantini commented Jun 15, 2018

nsantini commented Jun 15, 2018

elyas-bhy commented Jun 15, 2018

nsantini commented Jun 15, 2018

elyas-bhy commented Jun 18, 2018

nsantini commented Jun 18, 2018

nsantini commented Jun 19, 2018

elyas-bhy commented Jun 19, 2018

thisandagain commented Jun 19, 2018 •

edited

Loading

thisandagain left a comment

thisandagain Jun 19, 2018 •

edited

Loading

nsantini commented Jun 19, 2018

elyas-bhy Jun 20, 2018

nsantini Jun 20, 2018

elyas-bhy Jun 20, 2018

nsantini Jun 20, 2018

elyas-bhy Jun 22, 2018

nsantini Jun 22, 2018

nsantini commented Jun 22, 2018

elyas-bhy commented Jun 22, 2018

nsantini commented Jun 25, 2018

nsantini commented Jul 13, 2018

nsantini commented Jul 26, 2018

elyas-bhy commented Jul 26, 2018

nsantini commented Sep 4, 2018

elyas-bhy commented Sep 13, 2018

nsantini commented Feb 17, 2019

elyas-bhy commented Feb 18, 2019

elyas-bhy commented Feb 28, 2020

		if (!labels.hasOwnProperty(obj)) continue;

		// Check for negation

#116 Allow token processing "middleware" #144

Are you sure you want to change the base?

#116 Allow token processing "middleware" #144

Conversation

nsantini commented Jun 13, 2018

nsantini Jun 13, 2018

Choose a reason for hiding this comment

elyas-bhy left a comment

Choose a reason for hiding this comment

nsantini commented Jun 15, 2018

nsantini commented Jun 15, 2018

elyas-bhy commented Jun 15, 2018

nsantini commented Jun 15, 2018

elyas-bhy commented Jun 18, 2018

nsantini commented Jun 18, 2018

nsantini commented Jun 19, 2018

elyas-bhy commented Jun 19, 2018

thisandagain commented Jun 19, 2018 • edited Loading

Before

After

thisandagain left a comment

Choose a reason for hiding this comment

thisandagain Jun 19, 2018 • edited Loading

Choose a reason for hiding this comment

nsantini commented Jun 19, 2018

elyas-bhy Jun 20, 2018

Choose a reason for hiding this comment

nsantini Jun 20, 2018

Choose a reason for hiding this comment

elyas-bhy Jun 20, 2018

Choose a reason for hiding this comment

nsantini Jun 20, 2018

Choose a reason for hiding this comment

elyas-bhy Jun 22, 2018

Choose a reason for hiding this comment

nsantini Jun 22, 2018

Choose a reason for hiding this comment

nsantini commented Jun 22, 2018

elyas-bhy commented Jun 22, 2018

nsantini commented Jun 25, 2018

nsantini commented Jul 13, 2018

nsantini commented Jul 26, 2018

elyas-bhy commented Jul 26, 2018

nsantini commented Sep 4, 2018

elyas-bhy commented Sep 13, 2018

nsantini commented Feb 17, 2019

elyas-bhy commented Feb 18, 2019

elyas-bhy commented Feb 28, 2020

thisandagain commented Jun 19, 2018 •

edited

Loading

thisandagain Jun 19, 2018 •

edited

Loading